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Abstract 

Support vector machines have attracted much attention in theo- 
retical and in applied statistics. Main topics of recent interest arc 
consistency, learning rates and robustness. In this article, it is shown 
that support vector machines are qualitatively robust. Since support 
vector machines can be represented by a functional on the set of all 
probability measures, qualitative robustness is proven by showing that 
this functional is continuous with respect to the topology generated by 
weak convergence of probability measures. Combined with the exis- 
tence and uniqueness of support vector machines, our results show that 
support vector machines are the solutions of a well-posed mathematical 
problem in Hadamard's sense. 
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1 A Long Introduction 

Two of the most important topics in statistics are classification and regres- 
sion. There, it is assumed that the outcome y £ y of a random variable Y 
(output variable) is influenced by an observed value x £ X (input variable). 
On the basis of a finite data set ((xi, y±), . ■ ■ , (x n , y n )) 6 (X x y) n , the goal 
is to find an "optimal" predictor / : X — > y which makes a prediction f(x) 
for an unobserved y . In parametric statistics, a signal plus noise relationship 

V = fe{x) + e 

is often assumed, where fg is precisely known except for a finite parame- 
ter 9 £ R p and e is an error term (generated from a Normal distribution). 
In this way, the goal of estimating an "optimal" predictor (which can be 
any function / : X — > y) reduces to the much simpler task of estimating 
the parameter 9 6 MP . Since, in many applications, such strong assump- 
tions can hardly be justified, nonparametric regression has been developed 



which avoids (or at least considerably weakens) such assumptions. In sta- 
tistical machine learning, the method of support vector machines has been 
developed as a method of nonparametric regression; see e.g., Vapnik (1998), 



Scholkopf and Smola (2002), and Steinwart and Christmann 



(2008). There, 



the estimation of the predictor (called empirical SVM) is a function / which 
solves the minimization problem 
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where H is a certain function space H . The first term in ([I]) is the empirical 
mean of the losses caused by the predictions f(xi) and the second term 
penalizes the complexity of / in order to avoid overfitting, A is a positive 
real number, and the space if is a reproducing kernel Hilbert space (RKHS) 
which consists of functions / : X — > R . 



Since the arise of robust statistics (Tukey (1960), Huber (1964)), it is 
well-known that imperceptible small deviations of the real world from model 
assumptions may lead to arbitrarily wrong conclusions. While many prac- 
titioners are aware of the need for robust methods in classical parametric 
statistics, it is quite often overseen that robustness is also a crucial issue in 
nonparametric statistics. For example, the sample mean can be seen as a 
nonparametric procedure which is non-robust since it is extremely sensitive 
to outliers: Let X\, . . . ,X n be i.i.d. random variables with unknown distri- 
bution P and the task is to estimate the expectation of P . If the observed 
data are really generated by the ideal P (and if expectation and variance 
of P exist), then the sample mean is the optimal estimator. However, it 
frequently happens in the real world that, due to outliers or small model 
violations, the observed data are not generated by the ideal P but by an- 
other distribution P' . Even if P' is close to the ideal P , the sample mean 
may lead to disastrous results. Detailed descriptions and some examples of 



Tukey 


(1960), 


Huber 


(1964 


), and 


Huber 


(1981 



§1.1). 

In nonparametric regression, similar effects can occur. There, it is of- 
ten assumed that (Xi, Yi), . . . , (X n , Y n ) are i.i.d. random variables with un- 
known distribution P . This distribution P determines in which way the 
output variable Yi is influenced by the input variable Xj. However, estimat- 
ing a predictor / : X — > y can be severely distorted if the observed data 
(x\, yi), . . . , (x n , y n ) are - just as usual - not generated by P but by another 
distribution P' which may be close to the ideal P. In order to safeguard 
from severe distortions, an estimator S n should fulfill some kind of continu- 
ity: If the real distribution P' is close to the ideal distribution P , then the 
distribution of the estimator S n should hardly be affected (uniformly in the 
sample sizes n 6 IN). This kind of robustness is called qualitative robustness 



and has been formalized in Hampel ( 1968 , 1971 ) for estimators taking values 
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in BP. 

In order to study this notion of robust statistics for support vector ma- 



chines, we need a generalization given by Cuevas (1988) of this formalization 
because, here, the values of the estimator are functions f : X — > y which 
are elements of a (typically infinite dimensional) Hilbert space H . In case 
of support vector machines, the estimators 

S n : (X x y) n -> H 

can be represented by a functional 

S : Mi{X xy) -> H 

on the set M. \{X X y) of all probability measures on <Y x y : 



((xi,yi), . . . , (x„,y n )) 



for every (xi,yi), (x n , y n ) e X xy where \ Ya=i s (x it y. 




) is the empirical 



measure and Sr Xijy A denotes the Dirac measure in (xi,yi) . It is shown by 
Uuevas fll988| ) that, in such cases, the qualitative robustness of a sequence 
of estimators (S n ) n ^ follows from the continuity of the functional S (with 
respect to the topology of weak convergence of probability measures) . While 
quantitative robustness of support vector machines has already been investi- 
gated by means of Hampel's influence functions and bounds for the maxbias 
in Christmann and Steinwart (2007)) and by means of Bouligand influence 



functions in Christmann and Van Messem (2008), results about qualitative 



robustness of support vector machines have not been published so far. The 
goal of this paper is to fill this gap on research on qualitative robustness of 
support vector machines. 

The structure of the article is as follows: In the following Section [2j 
we recall the basic setup concerning support vector machines, define the 
functional S which represents the SVM-estimators S n , n € M, and quote 
the mathematical definition of qualitative robustness. In Section [3j we show 
that the functional S of support vector machines is, in fact, continuous 



under very mild assumptions (Theorem 3.2). In this way, it is also proven 



that, under the same assumptions, support vector machines are qualitatively 



robust (Theorem 3.1). In addition, it follows that empirical support vector 
machines are continuous in the data - i.e., they are hardly affected by slight 



changes in the data (Corollary 3.4). Under somewhat different assumptions, 



this has already been shown in Steinwart and Christmann (2008, Lemma 



5.13). Section g contains some concluding remarks. All proofs are given in 
the Appendix. 

It has to be pointed out that our results show that support vector ma- 
chines are qualitatively robust with a fixed regularization parameter A E 
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(0, oo). If the fixed regularization parameter A is replaced by a sequence of 
parameters A n G (0, oo) which decreases to with increasing sample size n, 
then support vector machines are not qualitatively robust any more under 
extremely mild conditions. This is demonstrated in Section 5.2 in the Ap- 
pendix. From our point of view, this is an important result as all universal 
consistency proofs we know of for support vector machines or for their risks, 
use an appropriate null sequence X n G (0, oo), n G IN. 



2 Support Vector Machines and Qualitative Ro- 
bustness 

Let (£l,A, Q) be a probability space, let X be a Polish space with Borel-cr- 
algebra 5S(Af) and let y be a closed subset of R with Borel-u-algebra *B(3^) • 
The Borel-cr-algebra of X x y is denoted by 53 (<Y x y) and the set of all 
probability measures on (X x y, ^(X x }>)) is denoted by Mi(X x y) . Let 

X u ...,X n : (0,i,Q) — > (X,<B(X)) 

and 

Y x ,...,Y n ■ (tt,A,Q) — ► {y,K(y)) 

be random variables such that (Xi, Yi), . . . , (X n , Y n ) are independent and 
identically distributed according to some unknown probability measure P € 

Mt(xxy). 

A measurable map L : X x y x R — > [0, oo) is called loss function. It 
is assumed that L(x, y, y) = for every (x, y) G X x y - that is, the loss is 
zero if the prediction f{x) equals the observed value y . In addition, we will 
assume that 



L(x,y,-) : R — > [0, oo), t h-> L(x,y,t) 

is convex for every (x, y) G X x y and that the following uniform Lipschitz 
property is fulfilled for a positive real number \L\\ G (0, oo) : 



sup 

(x,y)£Xxy 



\L(x,y,t) - L(x,y,t')\ < \L\i-\t-t'\ Vt,t ; GR. (2) 



We restrict our attention to Lipschitz continuous loss functions because the 
use of loss functions which are not Lipschitz continuous (such as the least 
squares loss on unbounded domains) usually conflicts with several notions 
of robustness; see, e.g., Steinwart and Christmann (2008, § 10.4). 
The risk of a measurable function / : X — > R is defined by 



^L,P(/) 



xxy 



L(x,y,f(x))P(d(x,y)) . 
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Let k : X x X — >• R be a bounded and continuous kernel with reproduc- 



ing kernel Hilbert space (RKHS) H. See e.g. Scholkopf and Smola (2002) 
or Steinwart and Christmann (2008) for details about these concepts. Note 



that H is a Polish space since every Hilbert space is complete and, according 
to Steinwart and Christmann (2008, Lemma 4.29), H is separable. Further- 
more, every / G H is a bounded and continuous function / : X — > R ; see 



Steinwart and Christmann (2008, Lemma 4.28). In particular, every / G H 
is measurable and its regularized risk is defined to be 



An element / G H is called a support vector machine and denoted by 
/l,p,a if it minimizes the regularized risk in H . That is, 

^l,p(/l,p,a) + M\fL,p,xf H = inf n LjP (f) + X\\f\\ 2 H . 
We would like to consider a functional 



S : P -> /l,P,A 



(3) 



However, support vector machines /l p A need not exist for every probability 
measure P G Ai\(X x y) and, therefore, S cannot be defined on M.\(X x 3^) 
in this way. A sufficient condition for existence of a support vector machine 



based on a bounded kernel k is, for example, T^l^O) < oo; see Steinwart 



and Christmann (2008, Corollary 5.3). In order to enlarge the applicability 



of support vector machines, the following extension has been developed in 



Christmann et al. ( 2009 ) . Following an idea already used by Huber ( 1967 ) for 



M-estimates in parametric models, a shifted loss function L* : X x y x II — > 
II is defined by 

L*{x, y, t) = L{x, y, t) - L(x, y,0) V (x, y, t) G X x y x R . 
Then, similar to the original loss function L, define the L* - risk by 

= j L*(x,y,f(x))P(d(x,y)) 

and the regularized L* - risk by 

^,p,a(/) = n L *?{f) + \\\ff H 

for every / G H . In complete analogy to /l,p,a , we define the support 
vector machine based on the shifted loss function L* by 

/x*,p,a = arg ini : K L ., P (f) + X\\f\\ 2 H . 



The following theorem summarizes some basic results derived by Christmann 



et al. (2009): 
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Theorem 2.1 For any P £ M\(X x 3^) , i/iere exists a unique /l*p\ £ ^ 
which minimizes 7Zl*,p,\ , i-e. 

KL;p(fL;r,\) + A||/l*,p,a|& = mi :K L ., P (f) + X\\ff H . 

// a support vector machine fi,P,\ G # exists (which minimizes 7£l,p.a 

/l*,p,a = /l,p,a • 
According to this theorem, the map 

5 : Mi(^x^) -> if, P ^ /l*,p,a 

exists, is uniquely defined and extends the functional in Therefore, S 
may be called SVM- functional. 

In order to estimate a measurable map / : X — > It which minimizes the 
risk 

n L , P (f) = [ L(x,y,f(x))P(d(x,y)) , 
Jxxy 

the SVM- estimator is defined by 

5 n : (Xxy) n ^ H, D n ' ^ /l,£)„,A 
where fi,,D n ,\ is that function f E H which minimizes 



1 - 

/(*,)) + A||/||^ 



n 

i=l 



in .ff for .D n = ((#1, #2), • • • , (x n , y n )) £ (Xxy) n . Let P^ n be the empirical 
measure corresponding to the data D n for sample size n £ M . Then, the 
definitions given above yield 

f L ,D n ,X = S n (D n ) = S(F Dn ) = / L , PoniA . (4) 

Note that the support vector machine uniquely exists for every empirical 
measure. In particular, this also implies /l,d„,x = fL*,F Dn ,\ ■ 

The main goal of the article is to show that, under very mild conditions, 
the sequence of SVM- estimators (SVOneiN is qualitatively robust. According 
to Cuevas (1988, Definition 1), the sequence (5* n ) n g]N is called qualitatively 



robust if the functions 

Mi(Xxy) -»■ Mi(H), P ^ S n (P n ) , raeM, 

are uniformly continuous with respect to the weak topologies on Ai 1 {X x y) 
and Aii(H) . Here, M\(H) denotes the set of all probability measures on 
(H,<B(H)), <B(H) is the Borel-cr-algebra on H, and S n (P n ) denotes the 
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Figure 1: Sketch: reasoning of robustness of S(P). Left: P, a neighborhood 
of P, and Aii(X x y). Right: S(P), a neighborhood of S(P), and the space 
of all probability measures of S(P) for P G M\{X x y). 



image measure of P n with respect to S n . Hence, S n (P n ) is the measure on 
(H, 03 (ff)) which is defined by 

(5 n (P"))(F) = F n ({D n e(xxyr\ S n (D n ) G F 

for every Borel-measurable subset F C H . Of course, this definition only 
makes sense if the SVM-estimators are measurable with respect to the Borel- 
u-algebras. This measurability is assured by Corollary |3.4| below. 

Since the weak topologies on A4\(X x y) and Aii(H) are metrizable 



by the Prokhorov metric dp ro (see Subsection 5.1), the sequence of SVM- 
estimators (5 , n ) ng M is qualitatively robust if and only if for every P G 
Aii {X x y) and every p > there is an e > such that 

dp ro (Q,P) < e d Pro (S n (Q n ),S n (P n )) < p Vn G IN . 

Roughly speaking, qualitative robustness means that the SVM-estimator 
tolerates two kinds of errors in the data: small errors in many observations 
(xi, yi) and large errors in a small fraction of the data set. These two kinds 
of errors only have slight effects on the distribution and, therefore, on the 
performance of the SVM-estimator (uniformly in the sample size). Figure [l] 
gives a graphical illustration of qualitative robustness. 

3 Main Results 

The following theorem is our main result and shows that support vector 
machines are qualitatively robust under mild conditions. 

Theorem 3.1 Let X be a Polish space and let y be a closed subset ofH. 
Let the loss function be a continuous function L : X x 3^ x II — > [0, oo) such 
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that L(x, y,y) = for every (x,y) £ X x y and 



L(x,y,-) : R — > [0,oo), t h-. L(x,y,t) 

is convex for every (x,y) £ X x y . Assume that the uniform Lipschitz 
property 

sup \L(x,y,t) - L(x,y,t')\ < \L\i-\t-t'\ Vi/eR 

(x, y )exxy 

is fulfilled for a real number \L\i € (0, oo) . Furthermore, let k : X x X — > R 
be a bounded and continuous kernel with RKHS H . 

Then, the sequence of SVM- estimators (SVJngw is qualitatively robust. 

Of course, this theorem applies to classification (e.g. y = {—1,1}) and 
regression (e.g. y = R or y = [0, oo)). In particular, note that every 
function g : y — > 1R is continuous if y is a discrete set - e.g. y = { — 1, 1} . 
In this case, assuming L to be continuous reduces to the assumption that 

X x R ->■ [0, oo), (x,t) H> L(x,y,t) 

is continuous for every y G y . Many of the most common loss functions 
are permitted in the theorem, e.g. the hinge loss and logistic loss for clas- 
sification, e-insensitive loss and Huber's loss for regression, and the pinball 
loss for quantile regression. The least squares loss is ruled out in Theorem 



3.1 - which is not surprising as it is the prominent standard example of a 



loss function which typically conflicts with robustness if X and y are un- 



bounded; see, e.g., Christmann and Steinwart (2007) and Christmann and 



Van Messem ( 2008 ) . Assuming continuity of the kernel k does not seem to be 
very restrictive as all of the most common kernels are continuous. Assuming 
k to be bounded is quite natural in order to ensure good robustness proper- 
ties. While the Gaussian RBF kernel is always bounded, polynomial kernels 
(except for the constant kernel) and the exponential kernel are bounded if 
and only if X is bounded. 

In our definition of the sequence (SVOneiN of SVM-estimators, the regu- 
larization parameter A is a fixed real number which does not change with 
n . Instead, it is also common to consider sequences of estimators 

T n : (Xxy) n ^H, D n ^f L , Dn , Xn , ne», 

where the fixed parameter A is replaced by a sequence (X n )neK C (0, oo) 
with lim.n__5.oo A n = . However, Theorem |3.1| cannot be generalized to 



(T n ) ng ]N . Proposition 5.2 (in the Appendix) shows under extremely mild 
conditions that (T n ) n ^ is not qualitatively robust. This is of interest be- 
cause appropriately chosen null sequences (A n ) ng ]N C (0, oo) are used to 

prove universal consistency of the risk 1Zl* pifi* ,D „,A„) — > m f/eJ r ^L*,p(/) 
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and /l*,d„,a„ — > arginffgjrT^L*^/) for n — > oo where T denotes the set 
of all measurable functions / : X — > R. This was first shown by |Steinwart 



d2002[), |Zhang| (|2004|), and |Steinwart| d2005[). We also refer to |Bousquet 



and Elisseeff] (p002|), |Bartlett et al.| p006| ), |Christmann et al.| ( [2009] ), and 
Steinwart and Anghell (|2009l). 



The proof of Theorem 3.1 is based on the following result which is inter- 
esting on its own. 



Theorem 3.2 Under the assumptions of Theorem 3.1, the SVM- functional 



S : Mi(Xxy) -> H, P ^ / L * )PjA 

is continuous with respect to the weak topology on Mi(X x y) and the norm 
topology on H . 



As a generalization of earlier results by, e.g., Zhang (2001), De Vito et al 



(2004), and Steinwart (2003), Christmann et al. (2009 Theorem 7) derived 



a representer theorem which showed that, for every Po £ Mi{X x y), there 
is a bounded map h : X x y — > R such that /l*,p ,a = ~j\ f h§ dPo and 



f, 



H 



< A 



-1 



J h<& dP - j 



(5) 



for every P £ M\(X x y) . The integrals in ^ are Bochner integrals of 
the vector-valued function h& : X x y — )• H , (x,y) i— )• h(x,y)&(x) where $ 
is the canonical feature map of A:, i.e. = k(-,x) for all x G . This 

offers an elegant possibility of proving Theorem |3.2| if we would accept some 

is true if / /i$ dP n 



3.2 



additional assumptions: The statement of Theorem 

converges to j h& dPo for every weakly convergent sequence P n — > Po ■ In 
the following, we show that the integrals indeed converge - under the ad- 
ditional assumptions that the derivative ^(x,y,t) exists and is continuous 
for every (x,y,t) £ X x y x Hi . These assumptions are fulfilled e.g. for the 
logistic loss function and Huber's loss function. In this case, it follows from 
Christmann et al. ( |2009 Theorem 7) that h is continuous. Since $ is con- 
tinuous and bounded (see e.g. |Steinwart and Christmann (2008, p. 124 and 
Lemma 4.29), the integrand /i<J> : X x y — > H is continuous and bounded. 
Then, it follows from Bourbaki (2004, p. 111.40) that / h<f> dP n converges to 
J h& dPo for every weakly convergent sequence P n — > Pq — just as in case 



of real-valued integrands; see Subsection |5.1| in the Appendix. 

Unfortunately, this short proof only works under the additional assump- 
tion of a continuous partial derivative ^ and this assumption rules out 
many loss functions used in practice, such as hinge, absolute distance and e- 
insensitive for regression and pinball for quantile regression. Therefore, our 



proof of Theorem 3.2 (without this additional assumption) does not use the 



representer theorem and Bochner integrals; it is mainly based on the theory 
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of Hilbert spaces and weak convergence of measures. In the following, we 



give some corollaries of Theorem 3.2 



Let C b {X) be the Banach space of all bounded, continuous functions 
/ : X — >• R with norm 

= sup |/(a?)| . 



Since k is continuous and bounded, we immediately get from Theorem 3.2 
and Steinwart and Christmann (2008 Lemma 4.28): 



Corollary 3.3 Under the assumptions of Theorem \3.1\ the SVM- functional 

Mi(xxy) -> c b (x), p ^ f L *,p,\ 

is continuous with respect to the weak topology on M.\{X x y) and the norm 
topology on Cb(X) . 

That is, sup^g^ \ fL,P',x(x) — fL,P,\(%)\ 1S small if P' is close to P . 



The next corollary is similar to Steinwart and Christmann ( 2008 , Lemma 
5.13) but only assumes continuity instead of differentiability of t \— > L(x, y, t). 
In combination with existence and uniqueness of support vector machines 
(see Theorem 2.1), this result shows that a support vector machine is the 



solution of a well-posed mathematical problem in the sense of Hadamard 

(poll). 



Corollary 3.4 Under the assumptions of Theorem 3.1, the SVM-estimator 
S n : (X x y) n — » H , D n ' y f L;Dn;X 

is continuous. 



In particular, it follows from Corollary 3.4 that the SVM-estimator S n is 
measurable. 

Remark 3.5 Let d n be a metric which generates the topology on ( X x y ) n , 
e.g. the Euclidean metric on ]R, n ( fc + 1 ) if X C R fc . Then Corollary 3.4 and 



Steinwart and Christmann (2008, Lemma 4-28) imply the following conti- 



nuity property of the SVM-estimator: For every e > and every data set 
D n G (X X y) n , there is a 5 > such that 

sup \fL,D' n ,\( X ) ~ fL,D n ,x(x)\ < e 

if D' n G (Xxy) n is any other data set with n observations and d n (D' n , D n ) < 
5. 

We finish this section with a corollary about strong consistency of sup- 
port vector machines which arises as a by-product of Theorem 3.2 Often, 
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asymptotic results of support vector machines show the convergence in prob- 
ability of the risk TZl* ,p(/l* ,b„,a„) to the Bayes risk inf f^jTZ^* ,p(/) and 
of /L*,D n ,A n to arginfy e _p7?-L* j p(/) , where F is the set of all measurable 
functions / : X — >• It and (A n ) ng N is a suitable null sequence. In contrast 
to that, the following corollary provides for fixed A G (0, oo) almost sure 
convergence of 72.£*,p(/£*,B n) A) to Ki,*,p(fL*,P,\) and of /i*,ro n ,A to /l*,p,a • 
This is an interesting fact, although the limit TZL*,p(fL*,p,x) will in general 
differ from the Bayes risk. 

Recall from Section [2] that the data points (xi,yi) from the data set 
D n = ((^1,^2), • • • {x n ,yn)) are realizations of i.i.d. random variables 

(Xi,Yi) : (n,AQ) — > (* x 3>, x ?)) , neW, 

such that 

(Xi,Fi) ~ P Vnel. 



Corollary 3.6 Define the random vectors 

D n := {(X 1 ,Y 1 ),...,(X n ,Y n )) 
and the corresponding H -valued random functions 



1 n 

fL*,n n ,x = arg inf - VL^Y^/pQ)) + X\\f\\ 2 H , nGl. 



From the assumptions of Theorem 3.1, it follows that 



(a) lim ||/l*d„,a - /x*,p,a||h = almost sure 

(b) lim sup c n - /l*,p,a(^)| = almost sure 
fcj lim P A (/i* D „ A ) = TlL*,P,x(.fL*,P,x) almost sure 
fdj lim T^-i* P (/i* d „,a) = ^l*,p(/l*,p.a) almost sure. 

If the support vector machine /lpx exists, then assertions (a)-(d) are 
also valid for L instead of L* . 

4 Conclusions 

It is well-known that outliers in data sets or other moderate model viola- 
tions can pose a serious problem to a statistical analysis. On the one hand, 
practitioners can hardly guarantee that their data sets do not contain any 
outliers, while, on the other hand, many statistical methods are very sensi- 
tive even to small violations of the assumed statistical model. Since support 
vector machines play an important role in statistical machine learning, in- 
vestigating their performance in the presence of moderate model violations 
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is a crucial topic - the more so as support vector machines are frequently 
applied to large and complex high-dimensional data sets. 

In this article, we showed that support vector machines are qualitatively 
robust with a fixed regularization parameter A G (0, oo), i.e., the perfor- 
mance of support vector machines is hardly affected by the following two 
kinds of errors: large errors in a small fraction of the data set and small 
errors in the whole data set. This not only means that these errors do not 
lead to large errors in the support vector machines but also that even the 
finite sample distribution of support vector machines is hardly affected. 

In contrast to that, we also showed that support vector machines are 
not qualitatively robust any more under extremely mild conditions, if the 
fixed regularization parameter A is replaced by a sequence of parameters 
A n G (0, oo) which decreases to with increasing sample size n. From our 
point of view, this is an important result as all universal consistency proofs 
we know of for support vector machines or for their risks, use an appropriate 
null sequence A n G (0,oo), n G IN. 



5 Appendix 



In Subsection |5.1[ we briefly recall some facts about weak convergence of 
probability measures. In addition, we show that weak convergence of prob- 
ability measures on a Polish space implies convergence of the correspond- 
ing Bochner integrals of bounded, continuous functions. Subsection |5.2| 
demonstrates under extremely mild conditions that the sequence of SVM- 
estimators cannot be qualitatively robust if the fixed regularization param- 
eter A is replaced by a sequence (A n ) ng ]N C (0, oo) with lim n _ K>0 A„ = . 



Subsection 5.3 contains all proofs. 



5.1 Weak Convergence of Probability Measures and Bochner 
Integrals 

Let Z be a Polish space with Borel-cr- algebra *B(Z), let d be a metric on Z 
which generates the topology on Z and let Ai\{Z) be the set of all proba- 
bility measures on (Z, *B(Z)) . 

A sequence (P n )neM of probability measures on Z converges to a prob- 
ability measure Po in the weak topology on Aii(Z) if 

lim / gdP n = [ gdP V g G C b (Z) 

where Cb(Z) denotes the set of all bounded, continuous functions g : Z — > 11 , 



sec 



Billingsley (1968, § 1 



The weak topology on A4\(Z) is metrizable by the Prokhorov metric 



dp ro ; see e.g. Huber (1981, §2.2). The Prokhorov metric <ip ro on M.\{Z) is 
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defined by 



dpro(Pi,P 2 ) = inf {e £ (0,oo) | Pi(B) < P 2 {B £ ) + e V£ G <B(Z)} 

where B £ = {z G 2 | inf z / g ^ z') < e} . 

Let 5 : Z — > M, be a continuous and bounded function. By definition, we 
have linin-^oo J g dP n = J g dPo for every sequence (Pn)neiN C A4\(Z) which 
converges weakly in M\(Z) to some Po • The following theorem states that 
this is still valid for Bochner integrals if g is replaced by a vector-valued 
continuous and bounded function \P : Z — > H , where H is a separable 
Banach space. This follows from a corresponding statement in Bourbaki] 



(2004, p. III. 40) for locally compact spaces Z. Boundedness of ^ means 



that sup zeZ \\^(z)\\h < oo . 

Theorem 5.1 Let Z be a Polish space with Borel-a- algebra %$(Z) and let 
H be a separable Banach space. If : Z — >■ H is a continuous and bounded 
function, then 

J ^ dP n — > j ^ dP (n oo) 

for every sequence (P n )neiN C M.i(Z) which converges weakly in M.\(Z) to 
some Po • 

5.2 A Counterexample 



Theorem [34j shows that, for a fixed regularization parameter A 6 (0,oo) , 
the sequence of SVM-estimators 

S n : (Xxy) n ^H, D n ^f L>Dn>x , neff, 

is qualitatively robust. The following proposition shows that, under ex- 
tremely mild conditions, the sequence of estimators 

T n : (Xxy) n ^H, D n h+ f LjDnM , neff, 

cannot be qualitatively robust if the fixed parameter A is replaced by a se- 
quence (A„) ng M C (0, oo) with linin^oo A ra = . This shows that the asymp- 
totic results on universal consistency of support vector machines - which 
consider appropriate null sequences (A n ) n6 M C (0, oo) - are in conflict with 
qualitative robustness of support vector machines using A n . (Asymptotic 
results on universal consistency of support vector machines can be found, 



e.g., in the references listed before Theorem 3.2 ) 

For simplicity, the following proposition focuses on regression because it 
is assumed that {0, 1} C 3^ • A similar proposition (with a similar proof) 
can also be given in case of binary classification where y = {—1, 1} . 
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Proposition 5.2 Let X be a Polish space and let y be a closed subset o/R 
such that {0,1} C y . Let k be a bounded kernel with RKHS H . Let L be 
a convex loss function such that L(x, y, y) = for every (x, y) G X x y . In 
addition, assume that there are xq,x\ £ X such that 

3feH: /> ) = 0, />i)^0 (6) 
L(xi,l,0) > . (7) 

Let (A n ) n gM C (0, oo) be any sequence such that lim n ^ l00 X n = . Then, the 
sequence of estimators 

T n : (Xxy) n ->H, D n ^f LjDnM , neff, 

is not qualitatively robust. 



5.3 Proofs 



In order to prove the main theorem, i.e. Theorem 3.1, we have to prove 



Theorem 3.2 and Corollary |3.4| at first. 



Proof of Theorem 3.2[ Since the proof is somewhat involved, we 



start with a short outline. The proof is divided into four parts. Part 1 
is concerned with some important preparations. We have to show that 
(/L*,p n ,A)neiN converges to fL*,p ,x in H if the sequence of probability mea- 
sures (P n ) n£ ]N weakly converges to the probability measure Po . Let us now 
assume that there is a subsequence (/r,*,p n \)iem of (fL*,p n ,\)n<aw which 
weakly converges to /l* i p 0i a in H . Then, it is shown in Part 2 and Part 3 
that 

lim Ki»,p (/t«,p a) = ^l*,p (/l*,p ,a) (8) 
lim K L * >P \(f L * y p n ,,\) = ^l*,p ,a(/l*,p ,a) • (9) 

Because of 

WfWl = j(n L *, P ,x(f)-K L .?(f)) WPeMxixxy) WfeH, 

it follows from Q and Q that hm^oo \\fL*,p„ e ,x\\H = ||/l*,p ,a||h • Since 
this convergence of the norms together with weak convergence in the Hilbert 
space H implies (strong) convergence in H, we get that the subsequence 
(/i*,P„ ,A)teisr converges to /l*,p 0i a in H . Part 4 extends this result to 
the whole sequence (/L*,p n ,A)neiN • The main difficulty in the proof is the 
verification of Q in Part 3. 

In order to shorten notation, define 

L) : X x y -> R , (x,y) H> L* (x, y, f(x)) = L(x, y, f{x)) - L(x, y, 0) 
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for every measurable / : X — )■ ft . Following e.g. van der Vaart ( 1998 ) and 
Pollard (2002), we use the notation 



gdP 



for integrals of real- valued functions g with respect to P . This leads to a 
very efficient notation which is more intuitive here because, in the following, 
P rather acts as a linear functional on a function space than as a probability 
measure on a a- algebra. 

By use of these notations, we may write 



pl} = J L)dP = n L * jP (f) 



for the (shifted) risk of / G H . Accordingly, the (shifted) regularized risk 
of / G H is 



K 



l*,p,a(/) = K L *,Af) + AH/lll = PL} + 



2 

H ■ 



Part 1 : Since the loss function L , the shifted loss L* and the regulariza- 
tion parameter A G (0, oo) are fixed, we may drop them in the notation and 
write 

fp ■= h*,p,x = S(P) VP g Mi(x x y) . 



Recall from Theorem 2.1 that fi* p \ is equal to the support vector machine 



Il,p,x if /l,p,a exists. That is, we have fp = Jl,p,\ in the latter case. 
According to |Christmann et al.| (f2009| (17), (16)), 



II/] 



P oo 



< 



A 



\L\i 



I oc 



WMh < 



1 



\L\ 



\fp\dP 



1 



\L\ 



i • 



\k\\ 



(10) 
(11) 



( A J ~ A 1 

for every P G M.\(X x 3^) • Since the kernel k is continuous and bounded, 



Steinwart and Christmann (2008, Lemma 4.28) yields 



/ G C b {X) V/Gil . (12) 
Therefore, continuity of L implies continuity of 

L} : Xxy -> R, (x,y) ^ L(s, y, /(x)) - L{x, y, 0) 
for every f £ H . Furthermore, the uniform Lipschitz property of L implies 
sup\L}(x,y)\ = swp\L(x,y,f{x))-L(x,y,0)\ 

x,y x,y 

< sup \L(x,y,f(x')) -L(x,y,0)\ < sup |L|i • \f(x') - 0| = |L|T 

x',x,y x' 
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for every / G H . Hence, we obtain 

L*f G C 6 (^x^) V/GiT. 



In particular, the above calculation and ( 10 ) imply 

1 



\L* f 



fp I 



< 



A 



LI ■ k 



VP G A^i(AT x y) 



(13) 



(14) 



For the remaining parts of the proof, let (P n )nelNo C x y) be 

any fixed sequence such that 

Pn — ► Po (n^oo) 

in the weak topology on A4\(X x y) - that is, 

lim P n g = P Q g V 5 G C fe (* x y) . (15) 



In particular, (13) and (15) imply 



lim P n L* { 



P L* f 



(16) 



In order to shorten the notation, define 

fn := / Pn = f L *,p n ,x = S(P«) Vn G IN U {0} . 
Hence, we have to show that (/ n ) ng K converges to fo in H - that is, 



lim 

n— ¥oo 



-M\h = o. 

Part 2 : In this part of the proof, it is shown that 

limsup P n L) n + \\\f n \\ 2 H < P L% + AH/oll^ 



(17) 



(18) 



Due to (13), the mapping 

Mi(X x y) -> R, 



P i-> PL} + A 



is defined well and continuous for every f £ H . As being the (pointwise) 
infimum over a family of continuous functions, the function 



Mi(x x y) 



R. 



P h- bf (PL* / + AH/HI,) 



is upper semicontinuous; see, e.g., Denkowski et al. (2003, Prop. 1.1.36). 
Therefore, the definition of f n implies 

limsup (P n L} n + A||/ n ||^) = limsup inj ](P n L* f + X\\f\\ 2 H ) < 

< inf (P ^ + AH/HI) = P L% + X\\f \\ 2 H . 
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Part 3 : In this part of the proof, the following statement is shown: 
Let (f ni )tem be a subsequence of (f n )neK and assume that (fn e )em 

converges weakly in H to some f' Q G H. Then, the following three assertions 

are true: 

= P ° L /o (19) 
/o = fo (20) 
lim H/^-ZoIIh = . (21) 

£— >oo 

In order to prove this, we will also have to deal with subsequences of the sub- 
sequence (fn e )££W ■ As this would lead to a somewhat cumbersome notation, 
we define 

P' £ := P ne and ft := f nt I E IN . 

Thus, ft = fL*,p ne ,x for every £ E M, Then, the assumption of weak con- 
vergence in the Hilbert space H equals 

}im(ft,h) H = (&,h} H VhEH. (22) 



First of all, we show ( 19 ) by proving 



limsuplP^L*, -P L* f , I < e (23) 



for every fixed eo > 0. In order to do this, fix any eq > and define 

£ " \L\i-{\\L\ 1 -\\k\\l + \\f,U) > °- (24) 

The following calculation shows that the sequence of functions is 
uniformly continuous on X . For any convergent sequence x m — > xq in X , 
we have 

lim sup sup |//(z m ) - /i(ar )| 

= limsup sup \(ft,$(x m ))H ~ (ft,$(x )) H \ 

= limsup sup \(ft,$(x m ) -$(x )) H \ 

< limsup sup \\f'i\\H ■ \\$(xm) - $(x )\\h 

_ y |i|i • ||fc||oo • hmsup ||$(x m ) - $(x )||// = 

where the first equality follows from the properties of the RKHS H and the 
last equality follows from Steinwart and Christmann (2008, Lemma 4.29). 
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Since X x y is a Polish space, weak convergence of (P^ 6 n implies 
uniform tightness of (P^ 6 ]n (see e.g. Dudley (1989, Theorem 11.5.3)). That 
is, there is a compact subset K £ C X x y such that 



limsup ~P' t (K*) < e. 

^— S>00 



(25) 



Since K £ is compact and the projection 
t x : X x y — >• , 



(z, y) i—)- x 



is continuous, K £ := tx(K £ ) is compact in <Y . For every i 6 Mo, the 
restriction of /J on K e is denoted by f' e . As the sequence (f'^eem is uniformly 
continuous on X and uniformly bounded in Cb(X) (see (10)), the sequence 
of the restrictions (/^efj has the corresponding properties on K £ . That is, 
(/g)teN is uniformly continuous on and uniformly bounded in Cb{K £ ) . 
Hence, the Arzela-Ascoli-Theorem - see Conway (1985 Theorem VI. 3. 8) - 
assures that (fgjeew is totally bounded and, therefore, relati vely compact 
in Cb(K e ) (since Cb(K £ ) is a complete metric space); see e.g. 



Dunford and 



Schwartz (1958, Theorem 1.6.15). 



The following reasoning shows that (f'^i^m converges to f in Cb(K £ 



i.e. 



lim sup \f' e (x) - fo(x)\ = . 



£^■00 



(26) 



x£K £ 



We will show ( |26[ ) by contradiction. If (26) is not true, then there is a 8 > 
and a subsequence (f' e . such that 



sup \fi.{x) - fo(x) 



> 5 V j e M . 



(27) 



Relative compactness of (f' e )e^ implies that there is a further subsequence 
(f'e )mew which converges in Cb(K £ ) to some ho £ Cb{K £ ) . Then, 



for every x £ K £ . That is, fL is the limit of (/» . ) me M - which is the desired 
contradiction to (27). Therefore, (26) is true. 

Now, we can prove (23): Firstly, the triangle inequality and the Lipschitz 
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continuity of L yield 
limsup tyLf, - PoL},\ < limsup \P' e L* f - P' e L* f | + \P' e L* fl - P L* f \ 



limsup iP^Lf/ — PgL*fi | 



lim sup 

I— >oc 



L{x, y, f' t {x)) - L(x, y, dP' e 
< limsup / \L\i ■ \ f' e (x) - fd(x)\ Pi(d(x,y)) = 

I— >oo J 

= \L\ X • limsup [ I \f>( x )-f>( x )\p' e (d(x,y)) + 
e^oo \ J k e 



+ I |/i(^)-/o(^)| P'Mx,y)) 

Iks 



Secondly, using K £ = tx(K £ ) , we obtain 
limsup / \f e (x)-f^x)\P , e (d(x,y)) 

£->oo J K E 

< limsup sup \fe(x) - /o(x)| = limsup sup \f((x) - f Q {x) 

l-^oo (x,y)eK e i-l-oo x& k s 



(i26S 



Thirdly, 



limsup / \f e (x) - fd(x)\P' e (d(x,y)) 



< limsup P'i{K t 



(25) 



S limsup e • (||/i||oc + ||/ol|oo) 

I— S>00 



/ il'X) + luolloo; 



Combining these three calculations proves ( 23 ) . Since £q > was arbitrarily 



chosen in (23), this proves (19). 



Next, we prove (20): Due to weak convergence of (f ni )e£K m H, it follows 



from Conway (1985, Exercise V.1.9) that 

||/olltf < liminf \\f m \\H ■ 
Therefore, the definition of fo = /l*,p 0i a implies 
P L% + X\\f \\ 2 H = infP ^ + AH/HI 



(28) 



< P L* f ,+X\\f^ H 



< limsup P ni L) +\\\f ni \\ I H ^ P L* o + A||/o||^ 



liminf P ne L* f + A||/ n j|- 

2 T> T * i \ II f II 2 
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Due to this calculation, it follows that 



+ A||/o 



2 

\li 



inf P L* f + A 



and 



\L),+X\\f G 



l\\2 
H 



(29) 



Po^ +A||/o||| 



According to Theorem 2.1 /o 
function 

H -> R, 



lim P^Lt +A||/ n ,||^. (30) 
= /l*,p ,a is the unique minimizer of the 
/ m- P L) + A||/|& 



and, therefore, (29) implies /o = /o ~ i>e. (20). 

Completing Part 3 of the proof, (21) is shown now: 



lim II/, 



"l \\H 



1 



I— >oo A 

1 



& i ( p «^L+ A ll^ll^) 



P L* fa + X\\f \\jj) 



P L /o 



II/0I& • 



By assumption, the sequence (/ n< )igH converges weakly to some £ H and 
by ( 20 ) , we know that /q = /o • In addition, we have proven lim^oo 1 1 f ne \\h = 
WfoWn now. This convergence of the norms together with weak convergence 



implies strong convergence in the Hilbert space H , - see, e.g., Conway ( 1985 
Exercise V.1.8). That is, we have proven (21). 



Part 4- In this final part of the proof, (17) is shown. This is done by 
contradiction: If (17) is not true, there is an e > and a subsequence 
(/njtew of (/n)neW such that 

\\fn e -fo\\H > e VfeH (31) 



According to (11), (f ne )tm = (/p ni )^eM is bounded in H . Hence, the 
sequence (f m )lew contains a further subsequence that weakly converges in 
H to some f' Q ; see e.g. Dunford and Schwartz (1958, Corollary IV. 4. 7). 
Without loss of generality, we may therefore assume that (f nt )l&l weakly 
converges in H to some /g . (Otherwise, we can choose another subsequence 
in (31)). Next, it follows from Part 3, that (/ n ^)feN strongly converges in 
H to /o - which is a contradiction to (|31|). □ 



Proof of Corollary 3.4[ Let (D n ,m)meH be a sequence in (X x y) n 
which converges to some -D n ,o £ (Xxy) n . Then, the corresponding sequence 
of empirical measures (Pd„ m ) mg]N weakly converges in Mi(Xxy) to ■ 
Therefore, the statement follows from Theorem 3.2 and Q. □ 



Based on 



Cuevas (1988), the main theorem essentially is a consequence 



of Theorem 13.2 
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Proof of Theorem 3.1 : According to Corollary |3.4[ the SVM-estimator 

S n : {Xxy) n -> H, D n ^ fh,D n ,\ 

is continuous and, therefore, measurable with respect to the Borel-cr-algebras 
for every n G IN . The mapping 

S : Mi(Xxy) -> H, P M> / l *,p, a 



is a continuous functional due to Theorem 3.2 Furthermore, 
= 5(P Dn ) VD„ G (# x 



VnElN. 



As already mentioned in Section [2j is a separable Hilbert space and, 
therefore, a Polish space. Hence, the sequence of SVM-estimators (5 n ) ne K 
is qualitatively robust according to Cuevas (1988. Theorem 2). □ 

Proof of Corollary 3.6[ Let Pb„ denote the function which maps uj G 
to the empirical measure ~ Ya=i ^(Xi(u),Yi(u)) • According to Varadara- 
jan's Theorem (Dudley ( 1989[ Theorem 11.4.1)), there is a set N £ A such 
that Q(N) = and PjD n (w) weakly converges to P for every u G H \ N . 
Then, Theorem |3.2| implies 



lim ||/ L . 



fL*,p,\\\H i lim ||5(P DnH )-5(P)||^ = 



for every uj G fi \ iV . This proves (a) and, due to Steinwart and Christmann 



(2008, Lemma 4.28), (b). The Lipschitz continuity of L* implies 

|^£*,p(/L*,B n (w),A) ~ ^L*,p(/l*,P,a)| 

= y L(x,y,fL*,T> n {u),\{x)) ~ L(x,y,fL*,P,x(x))P(d(x,y)) 

< / sup\L(x' ,y' , f L , :T)n{Lu)! x(x)) - L(x' ,y' , f L ^ P)X (x))\P(d(x,y)) 
J x',y' 

\fL*,D n (cu),\{x) - f L *,p,\(.x)\P(d(x,y)) 



< 



L 



< |£|l • ||/i*,D„(a;),A ~~ /i*,P,A|| 00 

for every w £ O. According to (b), the last term converges to for Q- 
almost every and this implies (d). Finally, (c) follows from (a) and 

(d). 

If /l,p,a exists, then /l*,p,x is equal to /l,p,x (Theorem 2.1). In particu- 
lar, there is an / G H such that (x, y) h-» L(x, y, /(x)) is P - integrable. Since 



Lipschitz-continuity of L and H C C&(A?) (see |Steinwart and Christmann 



(2008, Lemma 4.28)) implies P - integrability of (x,y) (->■ L*(x,y, f(x)) 
L(x, y, f(x))—L(x, y, 0) , we get that (x, y) h-> L(x, y, 0) is also P - integrable. 
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Therefore, HL*,p(f) is equal to 7£l.p(/) — T^l.p(O) for every f £ H, and 
7^l,p(0) is a finite constant which does not depend on /. Furthermore, 
fL*,D n ,\ = fL,D n ,\ for every D n £ {X x y) n ; see Section [2} Hence, the 
original assertions (a)-(d) for L* turn into the corresponding assertions for 
L instead of L* . □ 



Proof of Theorem If ^ = , the statement is true. Assume 

\t 7^ now and assume that the statement of the theorem is not true. Then, 
there is an e > and a subsequence (Pn^)<e)N such that 



> e 



VfeH. 



(32) 



Since the sequence (P n )neN weakly converges to Pq, it is uniformly tight; 



see, e.g., (Dudley, 1989, Theorem 11.5.3). That is, there is a compact subset 
K C Z such that 



[Z\K) < 



4supJ|tf(z)|| H 



(33) 



For every t € IN , let P nf denote the restriction of P n£ to the Borel-cr-algebra 
2$(if) of K . Let $ denote the restriction of \& to X. Since K is a compact 
Polish space, the set M.{K) of all finite signed measures on *&{K) is the 
dual space of C(K) (the set of all continuous functions / : K — > U); see e.g. 
(Dudley, 1989, Theorem 7.1.1 and 7.4.1). Accordingly, Ai(K) is precisely 



the set o f all (real) measu res in the sense of (Bourbaki, 2004, Section III.l); 



see also (Bourbaki 



2004 



Subsection III. 1.5 and III. 1.8). Since (P n ^)£eiN is 



relatively compact in the vague topology of M.{K) (Bourbaki, 2004, Sub- 
section III. 1.9), we may assume without loss of generality that (P ne )leN 
vaguely converges to some positive finite measure Pq. (Ot herwise, we may 



replace (PnJfeN by a further subsequence.) According to (Bourbaki 
p. III. 40), vague convergence implies 



2004 



oo 



(34) 



for Pettis and Bochner integrals (since H is assumed to be a separable Ba- 



nach space, Pettis integrals and Bochner integrals coincide; see e.g. (Dudley 
1989] p. 150)). 

Let H* be the dual space of H. Note that F o $ is continuous and 
bounded on Z for every F S H* . Hence, it follows from weak convergence 
of (P ne )e<=w to Pq and a property of the Bochner integral (Denkowski et al. 



2003 Theorem 3.10.16) that 



lim F 

£->oc < 



rif 



lim FoVdP 



"e 



dP = F 



^dPr 
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Accordingly, vague convergence of (P n< )^eiN to P' implies lim^oo F( j ^> dP ne ) 
F(JVdP' ). Hence, 



lim F 



^dP 



^dP 



hi 



F / ^dPr 



VdP'n 



For every lei, 











l^dp nl -pdp nl 




[ ^>dP ne 


< [ \\*\\HdPnt 




H 


JZ\K 


H JZ\K 



f dP n£ )\ < |. Hence, it follows from (Dunford and Schwartz 
Corollary II.3.15) that 



(35) 



(36) 



F\\h* < li (36) implies 



For every £ £ IN and every F £ H* such that 

\F(J ^dP ne - f ^dP ne )\ < | and, becau se of ]35j, also \F(J~^dP nf - 



1958 



/»*>„-/ 



< 



(37) 



By using the triangle inequality, we obtain 



^dP n , - / VdP 



H 















< 


J^dp nl -J^dp nt 


+ 


J*dP ne -J*dP' 


+ 


J^dP' -J^dP 






H 




H 





H 



so that (34), (36) and (37) imply limsup^^ || / ^ dP ni - f VdP \\ H < 
This is a contradiction to (32). 



e 

2' 

□ 



that 



Proof of Proposition 5.2 : Without loss of generality, we may assume 
f(x ) = and f(x 1 ) = l. (38) 



(Otherwise, we can divide / by f(x%) .) Since the function R — > [0, oo), i h-> 
L(xi, 1, i) is convex, it is also continuous. Therefore, ([7]) implies the existence 
of an 7 £ (0, 1) such that 



L(xi,l,7) > 



(39) 



Note that convexity of the loss function, L(x\, 1, 1) = and L(x±, 1, 7) > 
imply 







L{xx,l,l) < L(xi,l,t) < L(xi,l,7) < L(xi,l,s) 



(40) 



for0<s<7<t<l. Define P := 5( XO)0 ) . Since fL,6 (X0t0) ,\ n = , it 
follows that 



Po[{D n E(X x y) n I f LtDn 







1 . 



(41) 
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Next, fix any e G (0, 1) and define the mixture distribution 

P e := (1 - e)P + = (1 -e)8(x ,o) +^(x u i) ■ 

For every n G IN , let Z' n be the subset of (X x y) n which consists of all 
those elements D n = {D { n\ D^) G {X X y) n where 

£>» G {(x ,0),(xi,l)} Vt€{l,...,n} . 

In addition, let Z'^ be the subset of (X x y) n which consists of all those 
elements D n = {D { n\ . . . , D^) G (X x y) n where 

«({i6{l »}| D« = (xi,l)}) > |. (42) 

Define i? n := n 2^' • Then, we have P™(Z' n ) = 1 and, according to the 
law of large numbers (Dudley (1989, Theorem 8.3.5)), lim^oo P™(Z^) = 1 . 
Hence, there is an n e i G IN such that 



P"(2 n ) > - Vn>n £]1 . 



Due to linin^oo A n = and ( 39 ) , there is an ra e 2 G IN such that 



A, 



In the following, we show 

fL,D n ,\„(xi) > 7 VD n e2 n , Vn>n £i2 



(43) 



(44) 



(45) 



To this end, fix any D n G Z n . In order to prove (45), it is enough to show 
the following assertion for every n > n £) 2 '■ 



f€H, /(xi)< 7 



K-L,D n ,\ n (f) < T^L,D„,\„(f) ■ 



(46) 



The definition of Z n and ( 38 ) imply 

n L ,D n ,X n (f) = K L , Dn (f) + X n \\f\\ 2 H = An||/||| • 
For every / G H such that f(x\) < 7, the definition of Z n implies 



n LtDntXa (f) > n LtDn (f) T -L{ Xl ,i,f( Xl )) V -l(x 1j i,/(x 1 )). 

Hence, (46) follows from (44) and, therefore, we have proven ( |45[ ). 

Define n e = max{n £| i, n £) 2} • By assumption, k is a bounded, non-zero 
kernel. According to Steinwart and Christmann (2008, Lemma 4.23), this 
implies 



WfL,D n ,X n \\H > 



Ml ,D n ,\ 



V D n G Z n , Vn > n e 
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and, therefore, 

\\h,D n ,\ n \\H > min|pj|— ,l| =: c M D n £ Z n , \fn>n e . (47) 

Define F := {/ G H\ \\f\\ H > c] and 

Ft := {/ G H | hrf ||/ - /'k < f } C {/ G | II/Hh > 0} . (48) 

Hence, for every n > n e , we obtain 

[T„(P?)](F) = P^^nlll/^AJlH^c}) 9 P "(^) 



[r„(P5)] ({/ e jj | H/llif >o}) + i 



_ [T n (P5)] (Fs) + 



2 



According to the definition of the Prokhorov distance (see Subsection 5.1), 
it follows that 

sup d Pro (r n (P£),T n (P™)) > i (49) 
new v J I 

In addition, we have dp ro (Po,P £ ) < e because P e is an e-mixture of Po . 
Since c > does not depend on e G (0, 1) and e may be arbitrarily small, 
this proves that (T n ) n ^ is not qualitatively robust in Pq . □ 



References 

P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, 
and risk bounds. Journal of the American Statistical Association, 101: 
138-156, 2006. 

P. Billingsley. Convergence of probability measures. John Wiley & Sons, 
New York, 1968. 

N. Bourbaki. Integration. I. Chapters 1-6. Springer-Verlag, Berlin, 2004. 
Translated from the 1959, 1965 and 1967 French originals by Sterling K. 
Berberian. 

O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Ma- 
chine Learning Research, 2:499-526, 2002. 

A. Christmann and I. Steinwart. Consistency and robustness of kernel-based 
regression in convex risk minimization. Bernoulli, 13(3):799-819, 2007. 



25 



A. Christmann and A. Van Messem. Bouligand derivatives and robustness 
of support vector machines for regression. Journal of Machine Learning 
Research, 9:915-936, 2008. 

A. Christmann, A. Van Messem, and I. Steinwart. On consistency and 
robustness properties of support vector machines for heavy-tailed distri- 
butions. Statistics and Its Interface, 2:311-327, 2009. 

J. B. Conway. A course in functional analysis. Springer- Verlag, New York, 
1985. 

A. Cuevas. Qualitative robustness in abstract inference. Journal of Statis- 
tical Planning and Inference, 18:277-289, 1988. 

E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, and A. Verri. Some 
properties of regularized kernel methods. Journal of Machine Learning 
Research, 5:1363-1390, 2004. 

Z. Denkowski, S. Migorski, and N. Papageorgiou. An introduction to non- 
linear analysis: Theory. Kluwer Academic Publishers, Boston, 2003. 

R. Dudley. Real analysis and probability. Wadsworth & Brooks/Cole Ad- 
vanced Books & Software, Pacific Grove, CA, 1989. 

N. Dunford and J. Schwartz. Linear operators. I. General theory. Wiley- 
Interscience Publishers, New York, 1958. 

J. Hadamard. Sur les problemes aux derivees partielles et leur signification 
physique. Princeton University Bulletin, 13:49-52, 1902. 

F. R. Hampel. Contributions to the theory of robust estimation. PhD thesis, 
University of California, Berkeley, 1968. 

F. R. Hampel. A general qualitative definition of robustness. Annals of 
Mathematical Statistics, 42:1887-1896, 1971. 

P. J. Huber. Robust estimation of a location parameter. Annals of Mathe- 
matical Statistics, 35:73-101, 1964. 

P. J. Huber. The behavior of maximum likelihood estimates under non- 
standard conditions. In Proceedings of the Fifth Berkeley Symposium on 
Mathematical Statistics and Probability, Vol. I: Statistics, pages 221-233. 
University California Press, Berkeley, 1967. 

P. J. Huber. Robust statistics. John Wiley & Sons, New York, 1981. 

D. Pollard. A user's guide to measure theoretic probability. Cambridge 
University Press, Cambridge, 2002. 



26 



B. Scholkopf and A. J. Smola. Learning with kernels. MIT Press, Cambridge, 
2002. 

I. Steinwart. Support vector machines are universally consistent. Journal of 
Complexity, 18:768-791, 2002. 

I. Steinwart. Sparseness of support vector machines. Journal of Machine 
Learning Research, 4:1071-1105, 2003. 

I. Steinwart. Consistency of support vector machines and other regularized 
kernel classifiers. IEEE Transactions on Information Theory, 51:128-142, 
2005. 

I. Steinwart and M. Anghel. Consistency of support vector machines for 
forecasting the evolution of an unknown ergodic dynamical system from 
observations with unknown noise. Annals of Statistics, 37:841-875, 2009. 

I. Steinwart and A. Christmann. Support vector machines. Springer, New 
York, 2008. 

J. Tukey. A survey of sampling from contaminated distributions. In Contri- 
butions to probability and statistics, pages 448-485. Stanford Univ. Press, 
Stanford, Calif., 1960. 

A. van der Vaart. Asymptotic statistics. Cambridge University Press, Cam- 
bridge, 1998. 

V. N. Vapnik. Statistical learning theory. John Wiley & Sons, New York, 
1998. 

T. Zhang. Convergence of large margin separable linear classification. In 
T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural In- 
formation Processing Systems 13, pages 357-363. MIT Press, Cambridge, 
MA, 2001. 

T. Zhang. Statistical behavior and consistency of classification methods 
based on convex risk minimization. Annals of Statistics, 32:56-85, 2004. 



27 



